Clustering and Relational Ambiguity: from Text Data to Natural Data
نویسنده
چکیده
Text data is often seen as “take-away” materials with little noise and easy to process information. Main questions are how to get data and transform them into a good document format. But data can be sensitive to noise often called ambiguities. Ambiguities are aware from a long time, mainly because polysemy is obvious in language and context is required to remove uncertainty. One claim in this paper is that syntactic context is not sufficient to improve interpretation. This paper tries to explain that firstly noise can come from natural data themselves, even involving high technology, secondly texts, seen as verified but meaningless, can spoil content of a corpus; it may lead to contradictions and background noise. We used a set of papers in biology to identify ambiguous facts related to human interpretation and a nearest-neighbour display associated to a Zipfian distribution to compare structural content of a corpus. Four kinds of discourse technical, general, short-communication and artificial have been studied. keywords computational linguistics; paradox; contradiction; ambiguity; semantic relationship; domain; information extraction; corpus INTRODUCTION Human cognition refers to a diversity of concepts such as memory and brain anatomy, inference and reasoning, motivation, time and space, classification and clustering. Inference tries to identify good relations or properties associated to an object. In this sense it is also possible to test validity or consistency of a relation. Let be the proposition P = “a cat is a stone” is false or contradictory because a stone is not a living organism, though a cat is a living organism. P can be called paradoxal or contradictory. Sometimes, society lives with contradictions such as tolerance to many deaths on roads or in wars but intolerance for death from diseases. In this paper we more specifically focus on sources of potential contradictions that could spoil computation of information extraction. Formal semantics is attached to validate relations between a set of objects. Our study focuses on issues in managing complexity of a logical proposition and how to compute its truth value, but we also study how to extract relations and see how they are asserted as non contradictory with regard to other relations extracted in other texts. Thus texts are the primary material of discussion. Chapter 1 presents relational ambiguities we can find in text. We start by presenting a typology of logical relations. Given a type of relations, we explain how to extract such relations with markers in text files. But markers are not sufficient to detect a contradiction. A specialized language such as molecular corpus provides an example of ambiguous relations (contradictory) that cannot be detected with markers. Hence, we show that a global overview of words collocations in a corpuscan give a good signal about the structure. In our “publish or perish” new era of research and development system, production of literature is high but a non-negligible percent of papers becomes false over time. It is possible to compile from the
منابع مشابه
خوشهبندی اسناد مبتنی بر آنتولوژی و رویکرد فازی
Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...
متن کاملRelational Text Mining and Visualization
Discovering hidden patterns in distributed heterogeneous textual databases and unstructured data is a new challenge in data mining. Traditional data mining often assumes that preprocessing is already done -homogeneous data are available on the needed level. For distributed heterogeneous textual data this is not the case. Complex relations between items/entities (e.g., relations between people i...
متن کاملA Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملNatural scene text localization using edge color signature
Localizing text regions in images taken from natural scenes is one of the challenging problems dueto variations in font, size, color and orientation of text. In this paper, we introduce a new concept socalled Edge Color Signature for localizing text regions in an image. This method is able to localizeboth Farsi and English texts. In the proposed method rst a pyramid using diff...
متن کاملComparing k-means clusters on parallel Persian-English corpus
This paper compares clusters of aligned Persian and English texts obtained from k-means method. Text clustering has many applications in various fields of natural language processing. So far, much English documents clustering research has been accomplished. Now this question arises, are the results of them extendable to other languages? Since the goal of document clustering is grouping of docum...
متن کاملA Hybrid Grey based Two Steps Clustering and Firefly Algorithm for Portfolio Selection
Considering the concept of clustering, the main idea of the present study is based on the fact that all stocks for choosing and ranking will not be necessarily in one cluster. Taking the mentioned point into account, this study aims at offering a new methodology for making decisions concerning the formation of a portfolio of stocks in the stock market. To meet this end, Multiple-Criteria Decisi...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- JDMDH
دوره 2014 شماره
صفحات -
تاریخ انتشار 2014